Welcome to the music world!

First, Let’s take a look at the artist data

This plot shows number of artists formed over the years. Based on this, we can see that a lot of artists were formed in late 1990s. Is this trend going to be consistant with the lyrics data?

Before we start comparing, let’s clean/manage the lyrics data. I did the text processing from the given code. The text processing goes in these steps: 1. Eliminate white spaces and stop words from lyrics.
2. Stemming words and converting tm object to tidy object
3. Creating tidy format of the dictionary to be used for completing stems
4. Combining stems and dictionary into the same tibble
5. Stem completion
6. Pasting stem completed individual words into their respective lyrics
7. Keeping a track of the processed lyrics with their own ID
After this process, I also added character count, word count and decade categorization below. Also, I excluded NA values and two outliers that have a weird year dates.

We are now ready to look into the lyrics data!

Overview of the data

Let’s see number of songs as year progresses.

First, we can say that there is no correlation between the artists and lyrics data. This plot shows that there is a huge peak in 2006 and 2007, while the artist datashows peak in late 1990s. It seems these two numbers are abnormally large, so let’s try to look into this.

According to his bar plot, we can tell that the reason for that peak comes from the rise of the rock. We can also see the rise of Pop and Hip-hop from the 1990s to 2000s. It might seem weird to see total number of songs decrease from 2000s to 2010s, but this is due the characteristics of the data. the max year for songs is 2016.

Since the number of songs vary a lot from decade to decade, let’s try to see average length of the song.

From the plot above, we can definitely see the length of songs has increased.

Now, we want to see if there a correlation between number of words and number of characters? This could be an interesting question because recent Hip-hop songs use a lot of short, meaningless words like skrt, ay, etc. This could significantly increase number of word counts, but does not affect number of characters a lot.

However, these two charts show us that number of words and number of characters are higly correlated and consistent around all genres.

What if we exclude the influence of the Rock music?

Still, the number of songs increased significantly from early 2000s. Thus, we can conclude that the music industry boomed starting from early 2000s. Without Rock, number of Pop, Metal and Hip-Hop songs increased a lot.

We have general trend of music industry now. Wondering which words were used most frequently over 40 years across all genres?

Love wast the most popular one! We can see couple big texts, including love, time, youre, baby, hear, girl, etc. It might seem obvious, but many popular words are related to emotions or something that could describe someone.

Let’s Rock n Roll

Since we’ve seen the impact of the rock, let’s look closely into Rock music. We just saw the overal popularity of words across all genres. What about Rock?

Of couse, Rock wordcloud shows some similar words. (Since Rock has a major port in the data) We see more words relating to emotions and feelings. This wants me to look into the sentiment analysis of the Rock lyrics.

The wordcloud shows both positive and negative words used in Rock songs. Looking at positive words, you can definitely say these lyrics are used to describe the person you love. On the other hand, negative words are mostly used to explain your feeling when you are heartbroken and how sad you are. You can easily create both positive/negative plot just from this wordcloud.

Just by curiosity, let’s compare the positive/negative words with the Hip-Hop genre. (Assuming there will be more intense negative words)

And…the assumption was right. We can cleary see that Hip-Hop genre uses more intense negative words in their songs. (Lots of swearing words) Also, positive words doesn’t seem to have a same purpose like the Rock lyrics do. A lot of positive words are used to describe someone’s physique or to show off something.

Sentiment analysis above was based on the bing lexicon. However, this is not the only method. There are also AFINN and nrc. So, let’s compare the difference between three sentiment methods.

Three lexicon methods give different values as seen from the comparison plot. It might seem that AFINN method differs from Bing and NRC, but most of their trends are similar. AFINN values seems to fluctuare more than othe two methods. Overall, the absolute values are different, but they all have similar peaks and lows.

Now we know general sentiments of the Rock lyrics. Then, we also want to see the development of lyrics. Did it become more complexed or simple? Diversed or uniform?

Let’s look into the lexical diversity first.

We can see that the lexical diversity drastically increased in 2000s. Interesting insight from this plot is that the distribution looks quite normal for 2000s and 2010s, but it seems pretty flat for 1970s to 90s. This might happned due to the nature of the lyrics data or the actual lexical diversity increased starting from 2000s with the booming music industry.

Now, to the complexity of lyrics. In order to measure complexity, we can use readability package in R.

reability score
1970s 1980s 1990s 2000s 2010s
ARI 56.08218 64.98671 62.45764 58.86224 66.06284
Coleman 66.52581 65.83499 66.46525 66.09081 66.74829
Flesch 16.62471 35.29335 29.82373 22.87322 37.10190
RIX 10.26333 12.71278 11.45393 11.36770 12.19121

As we can see from the plot above, the readability differs by methods. ARI increase from 1970s to 1980s, decreases from 1980s to 2000s, but increases again in 2010s. For ARI, higher number means more complexity. Similarly, RIX shows the same trend. RIX value shows corresponding academic grade level to understand the songs. So, 1970s songs were interpretable for 10th graders, but 2010 songs seems to be understandable by 12th graders and up. Flesch values show the same trend as ARI and RIX. On the other hand, Coleman value stays quite stable for across all decades. Coleman method relies on characters instead of syllables per word.

Summary

  • The length of overall songs increased
  • Number of songs drastically increased starting from the 2000s
  • Rock music took the major portion
  • However, we can project the rise of Hip-hop and Pop starting in the 2010s
  • Lexicon diversity increased for Rock music
  • Complexity/readability of lyrics fluctuates by decades

For the future…

It would be great to wait for 2019 to end and do a whole analysis on 2010s decade data. I assume we are going to see a different trend from 2000s to 2010s because the Rock music has lost popularity. The change of Hip-Hop and Pop music would be interesting to look into.